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Abstract 

We propose a novel approach to reduce the computational cost of evaluation of 
convolutional neural networks, a factor that has hindered their deployment in low- 
power devices such as mobile phones. Inspired by the loop perforation technique 
from source code optimization, we speed up the bottleneck convolutional layers by 
skipping their evaluation in some of the spatial positions. We propose and analyze 
several strategies of choosing these positions. We demonstrate that perforation 
can accelerate modern convolutional networks such as AlexNet and VGG-16 by a 
factor of 2 x - 4 x . Additionally, we show that perforation is complementary to the 
recently proposed acceleration method of Zhang et al. |'28]]. 

1 Introduction 

The last few years have seen convolutional neural networks (CNNs) emerge as an indispensable tool 
for computer vision. However, modern CNNs have a high computational cost of evaluation, with 
convolutional layers usually taking up over 80% of the time. For instance, VGG-16 network f25] for 
the problem of object recognition requires 1.5 • 10 10 floating point multiplications per image. These 
computational requirements hinder the deployment of such networks on systems without GPUs and 
in scenarios where power consumption is a major concern, such as mobile devices. 

The problem of trading accuracy of computations for speed is well-known within the software 
engineering community. One of the most prominent methods for this problem is loop perforation 1181 
H91I241 . In a nutshell, this technique isolates loops in the code that are not critical for the execution, and 
then reduces their computational cost by skipping some iterations. More recently, researchers have 
considered problem-dependent perforation strategies that exploit the structure of the problem (23 j. 

Inspired by the general principle of perforation, we propose to reduce the computational cost of CNN 
evaluation by exploiting the spatial redundancy of the network. Modern CNNs, such as AlexNet, 
exploit this redundancy through the use of strides in the convolutional layers. However, using the 
convolutional strides changes the architecture of the network (intermediate representations size and 
the number of weights in the first fully-connected layer), which might be undesirable. Instead of 
using strides, we argue for the use of interpolation (perforation) of responses in the convolutional 
layer. A key element of this approach is the choice of the perforation mask, which defines the output 
positions to evaluate exactly. We propose several approaches to select the perforation masks and a 
method of choosing a combination of perforation masks for different layers. To restore the network 
accuracy, we perform fine-tuning of the perforated network. Our experiments show that this method 
can reduce the evaluation time of modem CNN architectures proposed in the literature by a factor of 

2 x - 4 x with a small decrease in accuracy. 

2 Related Work 

Reducing the computational cost of CNN evaluation is an active area of research, with both highly 
optimized implementations and approximate methods investigated. 
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Figure 1: Reduction of convolutional layer evaluation to matrix multiplication. Our idea is to leave 
only a subset of rows (defined by a perforation mask) in the data matrix M and to interpolate the 
missing output values. 


Implementations that exploit the parallelism available in computational architectures like GPUs 
(cuda-convnet2 02, CuDNN 0) have allowed to significantly reduce the evaluation time of CNNs. 
Since CuDNN internally reduces the computation of convolutional layers to the matrix-by-matrix 
multiplication (without explicitly materializing the data matrix), our approach can potentially be 
incorporated into this library. In a similar vein, the use of FPFGAs 11221 leads to better trade¬ 
offs between speed and power consumption. Several papers mm showed that CNNs may be 
efficiently evaluated using low precision arithmetic, which is important for FPFGA implementations. 
Most approximate methods of decreasing the CNN computational cost exploit the redundancies of 
the convolutional kernel using low-rank tensor decompositions (6j KSuS®!. In most cases, a 
convolutional layer is replaced by several convolutional layers applied sequentially, which have a 
much lower total computational cost. We show that the combination of perforation with the method 
of Zhang et al. l28l improves upon both approaches. 

For spatially sparse inputs, it is possible to exploit this sparsity to speed up evaluation and training |[8l . 
While this approach is similar to ours in the spirit, we do not rely on spatially sparse inputs. Instead, 
we sparsely sample the outputs of a convolutional layer and interpolate the remaining values. 

In a recent work, Lebedev and Lempitsky m also decrease the CNN computational cost by reducing 
the size of the data matrix. The difference is that their approach reduces the convolutional kernel’s 
support while our approach decreases the number of spatial positions in which the convolutions are 
evaluated. The two methods are complementary. 

Several papers have demonstrated that it is possible to compress the parameters of the fully-connected 
layers (where most CNN parameters reside) with a marginal error increase (4j ED EZJ • Since our 
method does not directly modify the fully-connected layers, it is possible to combine these methods 
with our approach and obtain a fast and small CNN. 

3 PerforatedCNNs 

The section provides a detailed description of our approach. Before proceeding further, we introduce 
the notation that will be used in the rest of the paper. 

Notation. A convolutional layer takes as input a tensor U of size 1x7 xS and outputs a tensor 
V of size X' x Y' x T, X' = X — d + 1, Y' = Y — d + 1. The first two dimensions are spatial 
(height and width), and the third dimension is the number of channels (for example, for an RGB input 
image S = 3). The set of T convolution kernels K is given by a tensor of size d x d x S x T. For 
simplicity of notation, we assume unit stride, no zero-padding and skip the biases. The convolutional 
layer output may be defined as follows: 

d d S 

V(x,y,t) = J2J2^2K(i,j,s,t)U(x + i- 1 ,y+j - l,s) (1) 

1=1 j= 1 S= 1 

Additionally, we define the set of all spatial indices (positions) of the output U = {1,..., A'} x 
{1,..., Y'}. Perforation mask I C U is the set of indices in which the outputs are calculated exactly. 
Denote N = |/| the number of positions to be calculated exactly, and r = 1 — the perforation 
rate. 

Reduction to matrix multiplication. To achieve high computational performance, many deep learn¬ 
ing frameworks, including Caffe 02 and MatConvNet [26|, reduce the computation of convolutional 
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layers to the heavily-optimized matrix-by-matrix multiplication routine of basic linear algebra pack¬ 
ages. This process, sometimes referred to as lowering , is illustrated in fig. [I] First, a data matrix M 
of size X'Y' x d 2 S is constructed using im2row function. The rows of Mare elements of patches 
of input tensor U of size d x d x S. Then, M is multiplied by the kernel tensor K reshaped into size 
d 2 S x T. The resulting matrix of size X'Y' x T is the output tensor V, up to a reshape. For a more 
detailed exposition, see f26) . 

3.1 Perforated convolutional layer 

In this section we present the perforated convolutional layer. In a small fraction of spatial positions, 
the outputs of the proposed layer are equal to the outputs of a usual convolutional layer. The 
remaining values are interpolated using the nearest neighbor from this set of positions. We evaluate 
other interpolation strategies in appendix [A[ 

The perforated convolutional layer is a generalization of the standard convolutional layer. When 
the perforation mask is equal to all the output spatial positions, the perforated convolutional layer’s 
output equals the conventional convolutional layer’s output. 

Formally, let / C £1 be the perforation mask of spatial output to be calculated exactly (the constraint 
that the masks are shared for all channels of the output is required for the reduction to matrix 
multiplication). The function £(x, y) : U —>> I returns the index of the nearest neighbor in I according 
to Euclidean distance (with ties broken randomly): 

£(x,y) = (ti(x,y),e 2 (x,y)) = arg min \/{x- a;') 2 + {y - y') 2 - (2) 

( x',y')el 

Note that the function £(x,y) may be calculated in advance and cached. 

The perforated convolutional layer output V is defined as follows: 

V{x,y,t) = V(t 1 (x,y),e 2 (x,y),t), (3) 

where V(x, y, t) is the output of the usual convolutional layer, defined by 0. Since £(x, y) = (x, y) 
for (x,y) G /, the outputs in the spatial positions I are calculated exactly. The values in other positions 
are interpolated using the value of the nearest neighbor. To evaluate a perforated convolutional layer, 
we only need to calculate the values V(x,y,t) for (x,y) G /, which can be done efficiently by 
reduction to matrix multiplication. In this case, the data matrix M contains just TV = |/| rows, instead 
of the original X'Y' = \Q\ rows. Perforation is not limited to this implementation of a convolutional 
layer, and can be combined with other implementations that support strided convolutions, such as the 
direct convolution approach of cuda-convnet2 na. 

In our implementation, we only store the output values V(x,y,t) for (x,y) G /. The interpolation 
is performed implicitly by masking the reads of the following pooling or convolutional layer. For 
example, when accelerating conv3 layer of AlexNet, the interpolation cost is transferred to conv4 
layer. We observe no slowdown of the conv4 layer when using GPU, and a 0-3% slowdown when 
using CPU. This design choice has several advantages. Firstly, the memory size required to store 
the activations is reduced by a factor of . Secondly, the following non-linearity layers and lxl 
convolutional layers are also sped up since they are applied to a smaller number of elements. 

3.2 Perforation masks 


We propose several ways of generating the perforation masks, or choosing N points from U. We 
visualize the perforation masks I as binary matrices with black squares in the positions of the set /. 
We only consider the perforation masks that are independent of the input object and leave exploration 
of input-dependent perforation masks to the future work. 


Uniform perforation mask is just N p oints chosen randomly without replacement from the set U. 
However, as can be seen from fig. 2a for N <C |U|, the points tend to cluster. This is undesirable 
because a more scattered set I woulcTreduce the average distance to the set I. 

Grid perforation mask is a set of points I = {a( 1),..., a(K x )} x {b( 1),..., b(K y )}, see fig. 2b 
We choose the values of a(i),b(i) using the pseudorandom integer sequence generation scheme or 
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Pooling structure mask exploits the structure of the overlaps of pooling operators. Denote by A(x, y) 
the number of times an output of the convolutional layer is used in the pooling operators. The grid-like 
pattern as in fig.[2d]is caused by a pooling of size 3x3 with stride 2 (such parameters are used e.g. 
in Network in Network and AlexNet). The pooling structure mask is obtained by picking top-N 


positions with the highest values of A(x, y), with ties broken randomly, see fig. 2c 
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(a) Uniform (b) Grid (c) Pooling struc- (d) Weights A(x, y) 

ture 

Figure 2: Perforation masks, AlexNet conv2, r = 80.25%. Best viewed in color. 



(a) B(x,y), origi-(b) B(x,y), perfo-(c) Impact mask, 
nal network rated network r — 90% 


Figure 3: Top: ImageNet images and corresponding values of impact G(x, y\ V ) for AlexNet conv2. 
Bottom: average impacts and impact perforation mask for AlexNet conv2. Best viewed in color. 


Impact mask estimates the impact of perforation of each position on the CNN loss function, and 
then removes the least important positions. Denote by L(V) the loss function of the CNN (such as 
negative log-likelihood) as a function of the considered convolutional layer outputs V. Next, suppose 
V' is obtained from V by replacing one element (#o, Vo, to) with a neutral value zero. We estimate 
the impact of a position as a first-order Taylor approximation of the magnitude of change of L(V): 


\L(V')-L(V)\ 


X Y T 

EEE 


x=ly= 1t=l 


duy) 

dV(x,y,t) 


(V'(x,y,t)-V(x,y,t)) 


duy) 

dV(x 0 ,y 0 ,t 0 ) 


V(x 0 ,yo,t 0 ) 


(4) 


The value 


dL(V) 


dv(x 0 y 0 1 0 ) ma y °btained using backpropagation. In the case of a perforated convolu¬ 
tional layer, we calculate the derivatives with respect to the convolutional layer output V (not the 
interpolated output V). This makes the impact of the previously perforated positions zero and sums 
the impact of the non-perforated positions over all the outputs which share the value. 

Since we are interested in the total impact of a spatial position (x,y) G ft, we take a sum over all the 
channels and average this estimate of impacts over the training dataset: 


G(x,y, n ^E 


t=l 


dL(V) 
dV(x ,y,t) 


V(x,y,t) 


y) IEy^ training set G{x, y \ V) 


(5) 

( 6 ) 


Finally, the impact mask is formed by taking the top-7V positions with the highest values of B(x,y). 
Examples of the values of G(x,y;V), B(x,y) and impact mask are shown on fig. f3 Note that the 
regions of the high value of G(x, y\ V) usually contain the most salient features ofthe image. The 
averaged weights B(x, y ) tend to be higher in the center since ImageNet’s images usually contain a 
centered object. Additionally, a grid-like structure of pooling structure mask is automatically inferred. 
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Network 

Dataset 

Error 

CPU time 

GPU time 

Mem. 

Mult. 

# conv 

NIN 

CIFAR-10 

top-1 10.4% 

4.6 ms 

0.8 ms 

5.1 MB 

2.2 • 10 s 

3 

AlexNet 

ImageNet 

top-5 19.6% 

16.7 ms 

2.0 ms 

6.6 MB 

0.5 • 10 y 

5 

VGG-16 

top-5 10.1% 

300 ms 

29 ms 

110 MB 

1.5- 10 iu 
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Table 1: Details of the CNNs used for the experimental evaluation. Timings, memory consumption 
and number of multiplications are normalized by the batch size. Memory consumption is the memory 
required to store activations (intermediate results) of the network during the forward pass. 



(a) conv2, CPU (b) conv2, GPU (c) conv3, CPU (d) conv3, GPU 

Figure 4: Acceleration of a single layer of AlexNet for different mask types without fine-tuning. 
Values are averaged over 5 runs. 


Since perforation of a layer changes the impacts of all the layers, in the experiments we iterate 
between increasing the perforation rate of a layer and recalculation of impacts. We find that this 
improves results by co-adapting the perforation masks of different convolutional layers. 

3.3 Choosing the perforation configurations 

For whole network acceleration, it is important to find a combination of per-layer perforation rates that 
would achieve high speedup with low error increase. To do this, we employ a simple greedy strategy. 
We use a single perforation mask type and a fixed range of increasing perforation rates. Denote by t 
the evaluation time of the accelerated network and by e the objective (we use negative log-likelihood 
for a subset of training images). Let to and eo be the respective values for the non-accelerated network. 
At each iteration, we try to increase the perforation rate for each layer and choose the layer for which 
this results in the minimal value of the cost function . 

to—t 

4 Experiments 


We use three convolutional neural networks of increasing size and computational complexity: Net¬ 
work in Network ifm . AlexNet 03 and VGG-16 (25), see table E In all networks, we attempt 
to perforate all the convolutional layers, except for the lxl convolutional layers of NIN. We 
perform timings on a computer with a quad-core Intel Core i5-4460 CPU, 16 GB RAM and a 
nVidia Geforce GTX 980 GPU. The batch size used for timings is 128 for NIN, 256 for AlexNet 
and 16 for VGG-16. The networks are obtained from Caffe Model Zoo. For AlexNet, the Caffe 
reimplementation is used which is slightly different from the original architecture (pooling and 
normalization layers are swapped). We use a fork of MatConvNet framework for all experi¬ 
ments, except for fine-tuning of AlexNet and VGG-16, for which we use a fork of Caffe. The 
source code is available at https://github.com/mfigurnov/perforated-cnn-matconvnet, 
https://github.com/mfigurnov/perforated-cnn-caffe. 

We begin our experiments by comparing the proposed perforation masks in a common benchmark 
setting: acceleration of a single AlexNet layer. Then, we compare whole-network acceleration 
with the best-performing masks to baselines such as decrease of input images size and an increase 
of strides. We proceed to show that perforation scales to large networks by presenting the whole- 
network acceleration results for AlexNet and VGG-16. Finally, we demonstrate that perforation is 
complementary to the recently proposed acceleration method of Zhang et al. ['28]]. 
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Method 

CPU time | 

Error t (%) 

Impact, r = |, 3 x 3 filters 

9.lx 

+ 1 

Impact, r — | 

5.3x 

+1.4 

Impact, r = | 

4.2x 

+0.9 

Lebedev and Lempitsky 111511 

20x 

top-1 +1.1 

Lebedev and Lempitsky [15] 

9x 

top-1 +0.3 

Jaderberg et al. ITOl 

6.6x 

+1 

Lebedev et al. llTbl 

4.5x 

+1 

Denton et al. |6) 

2.7x 

+1 


Table 2: Acceleration of AlexNet’s conv2. Top: our results after fine-tuning, bottom: previously 
published results. Result of Go) provided by [16]. The experiment with reduced spatial size of the 
kernel (3x3, instead of 5 x 5) suggests that perforation is complementary to the “brain damage” 
method of m which also reduces the spatial support of the kernel. 


4.1 Single layer results 

We explore the speedup-error trade-off of the proposed perforation masks on the two bottleneck 
convolutional layers of AlexNet, conv2 and conv3, see fig. 0 ] The pooling structure perforation 
mask is only applicable to the conv2 because it is directly followed by a max-pooling, whereas the 
conv3 is followed by another convolutional layer. We see that impact perforation mask works best 
for the conv2 layer while grid mask performs very well for conv3. The standard deviation of results 
is small for all the perforation masks, except the uniform mask for high speedups (where the grid 
mask outperforms it). The results are similar for both CPU and GPU, showing the applicability of 
our method for both platforms. Note that if we consider the best perforation mask for each speedup 
value, then we see that the conv2 layer is easier to accelerate than the conv3 layer. We observe this 
pattern in other experiments: layers immediately followed by a max-pooling are easier to accelerate 
than the layers followed by a convolutional layer. Additional results for NIN network are presented 
in appendix [B] 

We compare our results after fine-tuning to the previously published results on the acceleration 
of AlexNet’s conv2 in table [2] Motivated by the results of 115 ] that the spatial support of conv2 
convolutional kernel may be reduced with a small error increase, we reduce the kernel’s spatial size 
from 5 x 5 to 3 x 3 and apply the impact perforation mask. This leads to the 9.1 x acceleration for 
1% top-5 error increase. Using the more sophisticated method of [15] to reduce the spatial support 
may lead to further improvements. 

4.2 Baselines 

We compare PerforatedCNNs with the baseline methods of decreasing the computational cost of 
CNNs by exploiting the spatial redundancy. Unlike perforation, these methods decrease the size of 
the activations (intermediate outputs) of the CNN. For a network with fully-connected (FC) layers, 
this would change the number of CNN parameters in the first FC layer, effectively modifying the 
architecture. To avoid this, we use CIFAR-10 NIN network, which replaces FC layers with global 
average pooling (mean-pooling over all spatial positions in the last layer). 

We consider the following baseline methods. Resize. The input image is downscaled with the aspect 
ratio preserved. Stride. The strides of the convolutional layers are increased, making the activations 
spatially smaller. Fractional stride. Motivated by fractional max-pooling Q, we introduce a more 
flexible modification of strides which evaluates convolutions on a non-regular grid (with a varying 
step size), providing a more fine-grained control over the activations size and speedup. We use grid 
perforation mask generation scheme to choose the output positions to evaluate. 

We compare these strategies to perforation of all the layers with the two types of masks which 
performed best in the previous section: grid and impact. Note that “grid” is, in fact, equivalent to 
fractional strides, but with missing values being interpolated. 

All the methods, except resize, require a parameter value per convolutional layer, leading to a 
large number of possible configurations. We use the original network to explore this space of 
configurations. For impact, we use the greedy algorithm. For stride, we evaluate all possible 
combinations of parameters. For grid and fractional strides, for each layer we consider the set of rates 
b • • • > I’ Ifi tf or fractional strides this is the fraction of convolutions calculated), and evaluate all 
combinations of such rates. Then, for each method, we build a Pareto-optimal front of parameters 
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(a) Original network (b) After retraining 


Figure 5: Comparison of whole network perforation (grid and impact mask) with baseline strategies 
(resizing the input images, increasing the strides of convolutional layers) for acceleration of CIFAR-10 
NIN network. 


which produced smallest error increase for a given CPU speedup. Finally, we train the network 
weights “from scratch” (starting from a random initialization) for the Pareto-optimal configurations 
with accelerations close to 2 x, 3 x, 4 x. For fractional strides, we use fine-tuning, since it performs 
significantly better than training from scratch. 

The results are displayed on fig. [5] Impact perforation is the best strategy both for the original 
network and after training the network from scratch. Grid perforation is slightly worse. Convolutional 
strides are used in many CNNs, such as AlexNet, to decrease the computational cost of training and 
evaluation. Our results show that if changing the intermediate representations size and training the 
network from scratch is an option, then it is indeed a good strategy. Although more general, fractional 
strides perform poorly compared to strides, most likely because they “downsample” the outputs of a 
convolutional layer non-uniformly, making them hard to process by the next convolutional layer. 

4.3 Whole network results 

We evaluate the effect of perforation of all the convolutional layers of thre e CNN models. To tune the 
perforation rates, we employ the greedy method described in section [33] We use twenty perforation 
rates: For NIN and AlexNet we use the impact perforation mask. For VGG-16 

we use the grid perforation mask as we find that it considerably simplifies fine-tuning. Using more 
than one type of perforation masks does not improve the results. Obtaining the perforation rates 
configuration takes about one day for the largest network we considered, VGG-16. In order to 
decrease the error of the accelerated network, we tune the network’s weights. We do not observe any 
problems with backpropagation, such as exploding/vanishing gradients. The results are presented 
in table [3] Perforation damages the network performance significantly, but network weights tuning 
restores most of the accuracy. All the considered networks may be accelerated by a factor of two 
on both CPU and GPU, with under 2.6% increase of error. Theoretical speedups (reduction of the 
number of multiplications) are usually close to the empirical ones. Additionally, the memory required 
to store network activations is significantly reduced by storing only the non-perforated output values. 

4.4 Combining acceleration methods 

A promising way to achieve high speedups with low error increase is to combine multiple acceleration 
methods. For this to succeed, the methods should exploit different types of redundancy in the network. 
In this section, we verify that perforation can be combined with the inter-channel redundancy 
elimination approach of (128) 1° achieve improved speedup-error ratios. 

We reimplement the linear asymmetric method of (28). It decomposes a convolutional layer with a 
(d x d x S x T) kernel (height-width-input channels-output channels) into a sequence of two layers, 
(d x d x S x T') —» (lxlxT'xT), T' < T. The second layer is typically very fast, so the 
overall speedup is roughly ^. When decomposing a perforated convolutional layer, we transfer the 
perforation mask to the first obtained layer. 

We first apply perforation to the network and fine-tune it, as in the previous section. Then, we apply 
the inter-channel redundancy elimination method to this network. Finally, we perform the second 
round of fine-tuning with a much lower learning rate of le-9, due to exploding gradients. All the 
methods are tested at the theoretical speedup level of 4 x. When the two methods are combined, the 
acceleration rate for each method is taken to be roughly equal. The results are presented in the table 
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Network Device Speedup Mult. | Mem. | Error t (%) Tuned error t (%) 




2.2x 

2.5x 

2.Ox 

+1.5 

+0.4 


CPU 

3.lx 

4.4x 

3.5x 

+5.5 

+1.9 

NIN - 


4.2x 

6.6x 

4.4x 

+8.3 

+2.9 


2.lx 

3.6x 

3.3x 

+4.5 

+1.6 


GPU 

3.Ox 

10.lx 

5.7x 

+18.2 

+5.6 



3.5x 

19.lx 

9.2x 

+37.4 

+12.4 



2.Ox 

2.lx 

1.8x 

+10.7 

+2.3 


CPU 

3.Ox 

3.5x 

2.6x 

+28.0 

+6.1 

AlexNet - 


3.6x 

4.4x 

2.9x 

+60.7 

+9.9 


2.Ox 

2.Ox 

1.7x 

+8.5 

+2.0 


GPU 

3.Ox 

2.6x 

2.Ox 

+16.4 

+3.2 



4.lx 

3.4x 

2.4x 

+28.1 

+6.2 



2.Ox 

1.8x 

1.5x 

+15.6 

+1.1 


CPU 

3.Ox 

2.9x 

1.8x 

+54.3 

+3.7 

VGG-16 


4.Ox 

4.Ox 

2.5x 

+71.6 

+5.5 


2.Ox 

1.9x 

1.7x 

+23.1 

+2.5 


GPU 

3.Ox 

2.8x 

2.4x 

+65.0 

+6.8 



4.Ox 

4.7x 

3.4x 

+76.5 

+7.3 


Table 3: Full network acceleration results. Arrows indicate increase or decrease in the metric. 
Speedup is the wall-clock acceleration. Mult, is a reduction of the number of multiplications in 
convolutional layers (theoretical speedup). Mem. is a reduction of memory required to store the 
network activations. Tuned error is the error after training from scratch (NIN) or fine-tuning (AlexNet, 
VGG16) of the accelerated network’s weights. 


Perforation Asymm. (28) Mult. | Mem. 4- Error t (%) Tuned error t (%) 
40x - 40x 23 x +7L6 +5A 

3.9x 3.9x 0.93x +6.7 +2.0 

L8x Tlx 40x L4x +Z9 +L6 


Table 4: Acceleration of VGG-16, 4x theoretical speedup. First row is the proposed method, the 
second row is our reimplementation of linear asymmetric method of Zhang et al. (28), the third row 
is the combined method. Perforation is complementary to the acceleration method of Zhang et al. 


[4] While the decomposition method outperforms perforation, the combined method is better than 
Doth of the components. 

5 Conclusion 

We have presented PerforatedCNNs which exploit redundancy of intermediate representations of 
modern CNNs to reduce the evaluation time and memory consumption. Perforation requires only a 
minor modification of the convolution layer and obtains speedups close to theoretical ones on both 
CPU and GPU. Compared to the baselines, PerforatedCNNs achieve lower error, are more flexible 
and do not change the architecture of a CNN (number of parameters in the fully-connected layers 
and the size of the intermediate representations). Retaining the architecture allows to easily plug 
in PerforatedCNNs into the existing computer vision pipelines and only perform fine-tuning of the 
network, instead of complete retraining. Additionally, perforation can be combined with acceleration 
methods which exploit other types of network redundancy to achieve further speedups. 

In future, we plan to explore the connection between PerforatedCNNs and visual attention by 
considering input-dependent perforation masks that can focus on the salient parts of the input. 
Unlike recent works on visual attention EE] EH which consider rectangular crops of an image, 
PerforatedCNNs can process non-rectangular and even disjoint salient parts of the image by choosing 
appropriate perforation masks in the convolutional layers. 

Acknowledgments. We would like to thank Alexander Kirillov and Dmitry Kropotov for helpful 
discussions, and Yandex for providing computational resources for this project. This work was 
supported by RFBR project No. 15-31-20596 (mol-a-ved) and by Microsoft: Moscow State University 
Joint Research Center (RPD 1053945). 
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(a) conv2, nearest neighbor (b) conv2, replace with zero (c) conv2, barycentric 



(d) conv3, nearest neighbor (e) conv3, replace with zero (f) conv3, barycentric 
Figure 6: Comparison of with different interpolation strategies for perforated pixels. AlexNet network 


A Interpolation strategy 

In the paper, perforated values are interpolated using the value of the nearest neighbor. We compare 
this strategy to two alternatives: replacing with a constant zero and barycentric interpolation. For the 
second option, we perform Delaunay triangulation of the non-perforated points set. If a perforated 
point is in the interior of a triangle, then it is interpolated by a weighted sum of the values of the three 
vertices, with barycentric coordinates used as weights. Exterior perforated points are simply assigned 
the value of the nearest neighbor. 

The results of comparison on AlexNet are presented on figure [6] We measure theoretical speedup 
(reduction of number of multiplications) to ignore the differences in implementations of the inter¬ 
polation schemes. Replacing the missing values with zero is clearly not sufficient for successful 
acceleration of conv3 layer. Compared to the nearest neighbor, barycentric interpolation slightly 
improves results for pooling structure mask in conv2 and grid interpolation mask in the conv3 layer, 
but performs similarly or worse in other cases. Overall, nearest neighbor interpolation provides a 
good trade-off between complexity of the method (number of memory accesses per interpolated 
value) and the achieved error. 


B Single layer results for NIN network 

In section [4~j~| we have considered single-layer acceleration of conv2 and conv3 layers of AlexNet. 
Here we present additional results for acceleration of the three non-1 x 1 convolutional layers of NIN 
network. Each convolutional layer is followed by two lxl convolutions (which we treat as a part of 
non-linearity) and a pooling operation. Therefore, pooling structure mask applies to all layers. The 
results are presented on figure [7] We observe a similar pattern to the one observed in AlexNet conv2 
and conv3 layers: grid and impact perforation masks perform best. 

C Empirical and theoretical speedups 

As noted in Denton et al. m, achieving empirical speedups that are close to the theoretical ones 
(reduction of the number of multiplications) is quite complicated. We find that our method generally 
allows to do that, see table [5] For example, for theoretical speedup 4x, AlexNet conv2 empirical 
acceleration is 3.8x for CPU and 3.5 x for GPU. The results are below the theoretical speedup in 
almost all cases due to the additional memory accesses required. The perforation mask type does 
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Figure 7: Acceleration of a single layer of CIFAR-10 NIN network for different mask types without 
fine-tuning. Values are averaged over 5 runs. 



NIN 

CPU GPU 

AlexNet 
CPU GPU 

VGG-16 
CPU GPU 

convl 

4.4x 

2.7x 

3.2x 

2.7x 

2.5x 

2.2x 

conv2 

3.8x 

3.5x 

3.3x 

3.Ox 

2.6x 

2.lx 

conv3 

3.7x 

3.3x 

4.lx 

3.7x 

3.2x 

2.5x 

conv4 

- 

- 

3.9x 

3.5x 

3.lx 

2.6x 

conv5 

- 

- 

3.6x 

3.4x 

3.5x 

2.8x 

convb 

- 

- 

- 

- 

3.5x 

2.9x 

conv7 

- 

- 

- 

- 

3.4x 

2.9x 

conv8 

- 

- 

- 

- 

3.6x 

3.6x 

conv9 

- 

- 

- 

- 

3.6x 

3.7x 

convlO 

- 

- 

- 

- 

3.6x 

3.7x 

convl1 

- 

- 

- 

- 

3.7x 

3.6x 

convl2 

- 

- 

- 

- 

3.7x 

3.6x 

convl3 

- 

- 

- 

- 

3.8x 

3.6x 


Table 5: Per-layer empirical speedups for uniform perforation mask with r = 0.75. Theoretical 
speedup is 4x in all cases. Results are averaged over 5 runs 


not seem to affect the speedup. The difference between the empirical speedups on CPU and GPU 
highlights that it is important to choose per-layer perforation rates for the target device. 

D Implementation details 

A convolutional layer is typically applied to each image of the mini-batch sequentially. Fig. [8] 
shows the number of multiplications per second achieved by a quad-core Intel CPU and NVIDLA 
Geforce GTX 980 GPU on the bottleneck operation of evaluation of the convolutional layer: matrix 
multiplication of the data matrix M by the kernel matrix K. We see that increasing the perforation 
rate reduces the efficiency of the operation, especially for GPU, which is as expected: GPUs work 
best for large inputs. Thus, for a fair comparison with the non-accelerated implementation, we stack 
images of the mini-batch, to match the size of the original data matrix. This requires a tensor 
transpose operation after the matrix multiplication, but we find that this operation is comparatively 
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Figure 8: The efficiency of matrix-by-matrix multiplication (measured in multiplications per second) 
of the data matrix M by the kernel matrix K , for different perforation rates. AlexNet, the conv2 layer 


fast. The same idea is used in MShadow library Q. We also perform stacking of images for the 
baseline methods (resize, stride and fractional stride). 
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